Retain zero-mutation samples #44

dhimmel · 2018-04-12T22:11:21Z

Note this may not retain all samples without mutations but with sequencing. However, it does retain all samples that we're aware of via Xena that have been sequenced.

Refs cognoma#43 (comment)

dhimmel · 2018-04-12T22:16:53Z

Small code change as seen in diff for scripts/2.TCGA-process.py enables retaining samples that:

have mutation calls (potentially silent mutations)
but do not not have red or blue mutations.

This increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388. So we're gaining 9 samples for cognoma.

You can see specific samples added in the diff for data/samples.tsv. None are ovarian cancer.

Supercedes cognoma#45

dhimmel · 2018-04-13T18:02:26Z

In 9f9f675 I added a diseases.tsv file under data with summary information for each cancer. It's useful for tracking sample numbers by cancer type for the various datasets.

gwaybio

One typo

The updates look good. The resulting binary matrix represents a list of high confidence mutation calls for all samples with matching mutation, gene expression, and clinical data.

I am curious, are there any samples in either mutation or gene expression data that are not in the clinical data? This seems unlikely, but we probably don't want to filter these samples either (we can infer its acronym.)

gwaybio · 2018-04-14T21:08:06Z

scripts/2.TCGA-process.py

+    sample_ids is a pandas.Series
+    """
+    sample_ids = pandas.Series(sample_ids)
+    aconyms = sample_ids.map(sample_to_acronym)


aconyms --> acronyms

dhimmel · 2018-04-16T14:05:51Z

are there any samples in either mutation or gene expression data that are not in the clinical data?

I used the following code:

samples_missing_clinical = sorted((gene_mutation_mat_df.index & expr_df.index).difference(clinmat_df.index))
for sample_id in samples_missing_clinical:
    print(sample_id)
len(samples_missing_clinical)

Turns out there are 389 missing samples:

mutation or expression samples missing clinical data

TCGA-28-2510-01
TCGA-2G-AAKO-01
TCGA-2G-AALF-01
TCGA-2G-AALG-01
TCGA-2G-AALN-01
TCGA-2G-AALO-01
TCGA-2G-AALQ-01
TCGA-2G-AALR-01
TCGA-2G-AALS-01
TCGA-2G-AALT-01
TCGA-2G-AALW-01
TCGA-2G-AALX-01
TCGA-2G-AALY-01
TCGA-2G-AALZ-01
TCGA-2G-AAM2-01
TCGA-2G-AAM3-01
TCGA-2G-AAM4-01
TCGA-3N-A9WB-06
TCGA-3N-A9WC-06
TCGA-3N-A9WD-06
TCGA-5M-AAT5-01
TCGA-5M-AATA-01
TCGA-AR-A0U1-01
TCGA-BF-AAP0-06
TCGA-BH-A0HN-01
TCGA-C4-A0EZ-01
TCGA-C4-A0F1-01
TCGA-C4-A0F7-01
TCGA-D3-A1Q1-06
TCGA-D3-A1Q3-06
TCGA-D3-A1Q4-06
TCGA-D3-A1Q5-06
TCGA-D3-A1Q6-06
TCGA-D3-A1Q7-06
TCGA-D3-A1Q8-06
TCGA-D3-A1Q9-06
TCGA-D3-A1QA-06
TCGA-D3-A1QB-06
TCGA-D3-A2J6-06
TCGA-D3-A2J7-06
TCGA-D3-A2J8-06
TCGA-D3-A2J9-06
TCGA-D3-A2JA-06
TCGA-D3-A2JB-06
TCGA-D3-A2JC-06
TCGA-D3-A2JD-06
TCGA-D3-A2JE-06
TCGA-D3-A2JF-06
TCGA-D3-A2JG-06
TCGA-D3-A2JH-06
TCGA-D3-A2JK-06
TCGA-D3-A2JL-06
TCGA-D3-A2JN-06
TCGA-D3-A2JO-06
TCGA-D3-A2JP-06
TCGA-D3-A3BZ-06
TCGA-D3-A3C1-06
TCGA-D3-A3C3-06
TCGA-D3-A3C6-06
TCGA-D3-A3C7-06
TCGA-D3-A3C8-06
TCGA-D3-A3CB-06
TCGA-D3-A3CC-06
TCGA-D3-A3CE-06
TCGA-D3-A3CF-06
TCGA-D3-A3ML-06
TCGA-D3-A3MO-06
TCGA-D3-A3MR-06
TCGA-D3-A3MU-06
TCGA-D3-A3MV-06
TCGA-D3-A51E-06
TCGA-D3-A51F-06
TCGA-D3-A51G-06
TCGA-D3-A51H-06
TCGA-D3-A51J-06
TCGA-D3-A51K-06
TCGA-D3-A51N-06
TCGA-D3-A51R-06
TCGA-D3-A51T-06
TCGA-D3-A5GL-06
TCGA-D3-A5GN-06
TCGA-D3-A5GO-06
TCGA-D3-A5GR-06
TCGA-D3-A5GS-06
TCGA-D3-A5GU-06
TCGA-D3-A8GB-06
TCGA-D3-A8GC-06
TCGA-D3-A8GD-06
TCGA-D3-A8GE-06
TCGA-D3-A8GI-06
TCGA-D3-A8GJ-06
TCGA-D3-A8GK-06
TCGA-D3-A8GL-06
TCGA-D3-A8GM-06
TCGA-D3-A8GN-06
TCGA-D3-A8GO-06
TCGA-D3-A8GP-06
TCGA-D3-A8GQ-06
TCGA-D3-A8GR-06
TCGA-D3-A8GS-06
TCGA-D3-A8GV-06
TCGA-D9-A148-06
TCGA-D9-A149-06
TCGA-D9-A1JW-06
TCGA-D9-A1JX-06
TCGA-D9-A1X3-06
TCGA-D9-A3Z1-06
TCGA-D9-A3Z3-06
TCGA-D9-A4Z6-06
TCGA-D9-A6E9-06
TCGA-D9-A6EA-06
TCGA-D9-A6EC-06
TCGA-D9-A6EG-06
TCGA-DA-A1HV-06
TCGA-DA-A1HW-06
TCGA-DA-A1HY-06
TCGA-DA-A1I0-06
TCGA-DA-A1I1-06
TCGA-DA-A1I2-06
TCGA-DA-A1I4-06
TCGA-DA-A1I5-06
TCGA-DA-A1I7-06
TCGA-DA-A1I8-06
TCGA-DA-A1IA-06
TCGA-DA-A1IB-06
TCGA-DA-A1IC-06
TCGA-DA-A3F3-06
TCGA-DA-A3F5-06
TCGA-DA-A3F8-06
TCGA-DA-A95V-06
TCGA-DA-A95W-06
TCGA-DA-A95X-06
TCGA-DA-A95Y-06
TCGA-DA-A95Z-06
TCGA-DD-A116-01
TCGA-EB-A44Q-06
TCGA-EB-A44R-06
TCGA-EB-A5KH-06
TCGA-EB-A5SG-06
TCGA-EB-A5SH-06
TCGA-EB-A5UL-06
TCGA-EB-A5UN-06
TCGA-EB-A6L9-06
TCGA-EE-A17X-06
TCGA-EE-A17Y-06
TCGA-EE-A17Z-06
TCGA-EE-A180-06
TCGA-EE-A181-06
TCGA-EE-A182-06
TCGA-EE-A183-06
TCGA-EE-A184-06
TCGA-EE-A185-06
TCGA-EE-A20B-06
TCGA-EE-A20C-06
TCGA-EE-A20F-06
TCGA-EE-A20H-06
TCGA-EE-A20I-06
TCGA-EE-A29A-06
TCGA-EE-A29B-06
TCGA-EE-A29C-06
TCGA-EE-A29D-06
TCGA-EE-A29E-06
TCGA-EE-A29G-06
TCGA-EE-A29H-06
TCGA-EE-A29L-06
TCGA-EE-A29M-06
TCGA-EE-A29N-06
TCGA-EE-A29P-06
TCGA-EE-A29Q-06
TCGA-EE-A29R-06
TCGA-EE-A29S-06
TCGA-EE-A29T-06
TCGA-EE-A29V-06
TCGA-EE-A29W-06
TCGA-EE-A29X-06
TCGA-EE-A2A0-06
TCGA-EE-A2A1-06
TCGA-EE-A2A2-06
TCGA-EE-A2A5-06
TCGA-EE-A2A6-06
TCGA-EE-A2GB-06
TCGA-EE-A2GC-06
TCGA-EE-A2GD-06
TCGA-EE-A2GE-06
TCGA-EE-A2GH-06
TCGA-EE-A2GI-06
TCGA-EE-A2GJ-06
TCGA-EE-A2GK-06
TCGA-EE-A2GL-06
TCGA-EE-A2GM-06
TCGA-EE-A2GN-06
TCGA-EE-A2GO-06
TCGA-EE-A2GP-06
TCGA-EE-A2GR-06
TCGA-EE-A2GS-06
TCGA-EE-A2GT-06
TCGA-EE-A2GU-06
TCGA-EE-A2M5-06
TCGA-EE-A2M6-06
TCGA-EE-A2M7-06
TCGA-EE-A2M8-06
TCGA-EE-A2MC-06
TCGA-EE-A2MD-06
TCGA-EE-A2ME-06
TCGA-EE-A2MF-06
TCGA-EE-A2MG-06
TCGA-EE-A2MH-06
TCGA-EE-A2MI-06
TCGA-EE-A2MJ-06
TCGA-EE-A2MK-06
TCGA-EE-A2ML-06
TCGA-EE-A2MM-06
TCGA-EE-A2MN-06
TCGA-EE-A2MP-06
TCGA-EE-A2MQ-06
TCGA-EE-A2MR-06
TCGA-EE-A2MS-06
TCGA-EE-A2MT-06
TCGA-EE-A2MU-06
TCGA-EE-A3AA-06
TCGA-EE-A3AB-06
TCGA-EE-A3AC-06
TCGA-EE-A3AD-06
TCGA-EE-A3AE-06
TCGA-EE-A3AF-06
TCGA-EE-A3AG-06
TCGA-EE-A3AH-06
TCGA-EE-A3J3-06
TCGA-EE-A3J4-06
TCGA-EE-A3J5-06
TCGA-EE-A3J7-06
TCGA-EE-A3J8-06
TCGA-EE-A3JA-06
TCGA-EE-A3JB-06
TCGA-EE-A3JD-06
TCGA-EE-A3JE-06
TCGA-EE-A3JH-06
TCGA-EE-A3JI-06
TCGA-ER-A193-06
TCGA-ER-A195-06
TCGA-ER-A197-06
TCGA-ER-A198-06
TCGA-ER-A199-06
TCGA-ER-A19A-06
TCGA-ER-A19B-06
TCGA-ER-A19C-06
TCGA-ER-A19D-06
TCGA-ER-A19E-06
TCGA-ER-A19F-06
TCGA-ER-A19G-06
TCGA-ER-A19H-06
TCGA-ER-A19J-06
TCGA-ER-A19L-06
TCGA-ER-A19M-06
TCGA-ER-A19N-06
TCGA-ER-A19O-06
TCGA-ER-A19P-06
TCGA-ER-A19Q-06
TCGA-ER-A19S-06
TCGA-ER-A19W-06
TCGA-ER-A1A1-06
TCGA-ER-A2NC-06
TCGA-ER-A2ND-06
TCGA-ER-A2NE-06
TCGA-ER-A2NG-06
TCGA-ER-A2NH-06
TCGA-ER-A3ES-06
TCGA-ER-A3ET-06
TCGA-ER-A3EV-06
TCGA-ER-A3PL-06
TCGA-ER-A42K-06
TCGA-ER-A42L-06
TCGA-F5-6810-01
TCGA-FR-A3YN-06
TCGA-FR-A3YO-06
TCGA-FR-A44A-06
TCGA-FR-A69P-06
TCGA-FR-A729-06
TCGA-FR-A7U8-06
TCGA-FR-A7U9-06
TCGA-FR-A7UA-06
TCGA-FR-A8YC-06
TCGA-FR-A8YD-06
TCGA-FR-A8YE-06
TCGA-FS-A1YW-06
TCGA-FS-A1YX-06
TCGA-FS-A1YY-06
TCGA-FS-A1Z0-06
TCGA-FS-A1Z3-06
TCGA-FS-A1Z4-06
TCGA-FS-A1Z7-06
TCGA-FS-A1ZA-06
TCGA-FS-A1ZB-06
TCGA-FS-A1ZC-06
TCGA-FS-A1ZD-06
TCGA-FS-A1ZE-06
TCGA-FS-A1ZF-06
TCGA-FS-A1ZG-06
TCGA-FS-A1ZH-06
TCGA-FS-A1ZJ-06
TCGA-FS-A1ZK-06
TCGA-FS-A1ZM-06
TCGA-FS-A1ZP-06
TCGA-FS-A1ZQ-06
TCGA-FS-A1ZR-06
TCGA-FS-A1ZS-06
TCGA-FS-A1ZT-06
TCGA-FS-A1ZU-06
TCGA-FS-A1ZW-06
TCGA-FS-A1ZY-06
TCGA-FS-A1ZZ-06
TCGA-FS-A4F0-06
TCGA-FS-A4F4-06
TCGA-FS-A4F5-06
TCGA-FS-A4F8-06
TCGA-FS-A4F9-06
TCGA-FS-A4FB-06
TCGA-FS-A4FC-06
TCGA-FS-A4FD-06
TCGA-FW-A3I3-06
TCGA-FW-A3R5-06
TCGA-FW-A3TU-06
TCGA-FW-A3TV-06
TCGA-FW-A5DY-06
TCGA-GF-A3OT-06
TCGA-GF-A4EO-06
TCGA-GF-A6C8-06
TCGA-GF-A6C9-06
TCGA-GN-A262-06
TCGA-GN-A264-06
TCGA-GN-A265-06
TCGA-GN-A266-06
TCGA-GN-A267-06
TCGA-GN-A268-06
TCGA-GN-A26A-06
TCGA-GN-A26D-06
TCGA-GN-A4U3-06
TCGA-GN-A4U4-06
TCGA-GN-A4U7-06
TCGA-GN-A4U8-06
TCGA-GN-A4U9-06
TCGA-GN-A8LK-06
TCGA-GN-A8LL-06
TCGA-GN-A9SD-06
TCGA-HR-A2OG-06
TCGA-HR-A2OH-06
TCGA-LH-A9QB-06
TCGA-OD-A75X-06
TCGA-QB-A6FS-06
TCGA-QB-AA9O-06
TCGA-R8-A6YH-01
TCGA-RP-A690-06
TCGA-RP-A693-06
TCGA-RP-A694-06
TCGA-RP-A695-06
TCGA-RP-A6K9-06
TCGA-W3-A824-06
TCGA-W3-A825-06
TCGA-W3-A828-06
TCGA-W3-AA1O-06
TCGA-W3-AA1Q-06
TCGA-W3-AA1R-06
TCGA-W3-AA1V-06
TCGA-W3-AA1W-06
TCGA-W3-AA21-06
TCGA-WE-A8JZ-06
TCGA-WE-A8K1-06
TCGA-WE-A8K5-06
TCGA-WE-A8K6-06
TCGA-WE-A8ZM-06
TCGA-WE-A8ZN-06
TCGA-WE-A8ZO-06
TCGA-WE-A8ZQ-06
TCGA-WE-A8ZR-06
TCGA-WE-A8ZT-06
TCGA-WE-A8ZX-06
TCGA-WE-A8ZY-06
TCGA-WE-AA9Y-06
TCGA-WE-AAA0-06
TCGA-WE-AAA3-06
TCGA-WE-AAA4-06
TCGA-YD-A89C-06
TCGA-YD-A9TA-06
TCGA-YD-A9TB-06
TCGA-YG-AA3O-06
TCGA-YG-AA3P-06
TCGA-Z2-A8RT-06
TCGA-Z2-AA3S-06
TCGA-Z2-AA3V-06

probably don't want to filter these samples either (we can infer its acronym.)

Hmm this seems like an upstream issue and I'd prefer an upstream fix rather than hacking it ourselves. I propose we merge this PR and deal with this issue subsequently.

gwaybio · 2018-04-16T14:23:18Z

Ah, this is interesting, and now that I think about it, totally expected.

If I am remembering correctly (haven't confirmed) the clinical data stores mostly 01 sample-types. 01 refers to Primary Solid Tumor (dictionary here). So, all of the 06 tumors (Metastatic) will be dropped! Even though the clinical data should match for the patient id instead of the sample id.

I am not sure where the upstream fix of this should live. Perhaps we should investigate sample specific vs. patient specific clinical data and merge mutation/gene exp calls on patient ID after the first merge on sample_id while retaining only patient specific identifiers (age, acronym, etc.) for these samples.

dhimmel · 2018-04-16T14:43:49Z

I think it's simpler to not include metastatic tumors as there are not that many and they may break the independence between observation assumption of many classifiers (not sure if that really matters).

gwaybio · 2018-04-16T14:47:09Z

I think it's simpler to not include metastatic tumors as there are not that many

Nearly all of the melanoma tumors (SKCM) are metastatic - these will be dropped if we go this route.

dhimmel · 2018-04-16T17:11:59Z

Turns out there are 389 missing samples:

FYI, I think this is not true and instead results from us filtering by sample types earlier in the notebook:

cancer-data/scripts/2.TCGA-process.py

Lines 165 to 171 in 93e4c53

    
           # Keep only these sample types 
        
           # filters duplicate samples per patient 
        
           sample_types = { 
        
               'Primary Solid Tumor', 
        
               'Primary Blood Derived Cancer - Peripheral Blood', 
        
           } 
        
           clinmat_df.query("sample_type in @sample_types", inplace=True)

Hence I gave my comment above a 👎 .

Retain zero-mutation samples

77d8b79

Refs cognoma#43 (comment)

Small doc update

2924585

dhimmel requested a review from gwaybio April 13, 2018 16:12

This was referenced Apr 13, 2018

Current dataset seems to have an issue cognoma/core-service#99

Open

Move diseases.tsv to data #45

Closed

Create data/diseases.tsv with summary info

9f9f675

Supercedes cognoma#45

gwaybio approved these changes Apr 14, 2018

View reviewed changes

Fix typo

a8a0b5a

dhimmel merged commit 93e4c53 into cognoma:master Apr 16, 2018

dhimmel deleted the zero-mutation-samples branch April 16, 2018 14:09

gwaybio mentioned this pull request Apr 16, 2018

Retain Metastatic Tumors #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain zero-mutation samples #44

Retain zero-mutation samples #44

dhimmel commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 12, 2018

dhimmel commented Apr 13, 2018

gwaybio left a comment

gwaybio Apr 14, 2018

dhimmel commented Apr 16, 2018 •

edited

Loading

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018 •

edited

Loading

Retain zero-mutation samples #44

Retain zero-mutation samples #44

Conversation

dhimmel commented Apr 12, 2018 • edited Loading

dhimmel commented Apr 12, 2018

dhimmel commented Apr 13, 2018

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio Apr 14, 2018

Choose a reason for hiding this comment

dhimmel commented Apr 16, 2018 • edited Loading

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018 • edited Loading

dhimmel commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 16, 2018 •

edited

Loading

dhimmel commented Apr 16, 2018 •

edited

Loading